Source: https://www.kaggle.com/zynicide/wine-reviews/downloads/wine-reviews.zip/4

My dataset is comprised of almost 130,000 reviews of invididual wines organized price, rating on a 0-100 point scale, nationality, type, year, taster, and winery of origin. Wine snobs annoy me, so I wanted to see if anything they have to say about quality holds water statistically.

data <- read.csv(file="winemag-data-130k-v2.csv")
data

Common assertions about wine include a relationship between price and quality, the statement that “X was a good year for Y wine,” and the idea that certain countries make better wines. I’m going to explore these relationships using this database.

First, the columns of the table.

colnames(data)
##  [1] "X"                     "country"              
##  [3] "description"           "designation"          
##  [5] "points"                "price"                
##  [7] "province"              "region_1"             
##  [9] "region_2"              "taster_name"          
## [11] "taster_twitter_handle" "title"                
## [13] "variety"               "winery"

These can be refined or removed to add clarity. X is wholly unnecessary in this environment, denoting a row ID, while description and designations’ use as qualitative data is irrelevant in the context of this paper.

data <- data %>% mutate(X=NULL, description=NULL, designation=NULL)

This removes those three columns from the table.

Next, what makes a good wine according to the data? Sorting mean rating by country and province of origin is easy enough.

mean_score_nationality <- data %>% select(points, country) %>% group_by(country)%>% summarize(score=mean(points))
mean_score_nationality

According to this, England produces the best wine on average, but a graphical aid would better display the differences between countries.

  mean_score_nationality %>% ggplot(aes(x=country, y=score)) + geom_bar(stat="identity", width=.5) + labs(x="country", y="mean rating")

This shows that while differences exist in ratings by nationality, the typical magnitude of that difference is relatively small. The power of data science is that a few lines of code can reduce an otherwise-insurmountable quantity of measurements to a graphic or figure parseable by the unaided eye.

How does price affect rating? This question is better suited to a linear regression of the data to find the relationship between the two continuous variables.

data %>% ggplot(aes(x=price,y=points))+geom_point()+geom_smooth(method=lm)+labs(x="price", y="rating")
## Warning: Removed 8996 rows containing non-finite values (stat_smooth).
## Warning: Removed 8996 rows containing missing values (geom_point).

The regression line on the graph looks off, so let’s gather some information about the relationship.

lmfit <- lm(points ~ price, data)
lmfit
## 
## Call:
## lm(formula = points ~ price, data = data)
## 
## Coefficients:
## (Intercept)        price  
##    87.32964      0.03089

The linear regression predicts that for every increment in price, the rating of the wine increases by .03. However, the graph from earlier looked like the bulk of the data was significantly below the line. The broom package provides functions for measuring the strength of a regression relationship.

tidy(lmfit)
glmfit <- glm(points ~ price, data, family="poisson")
tidy(glmfit)